Update report: LEO-II version 1.5 



Christoph Benzmiiller 1 and Nik Sultana 2 

1 Freie Universitat Berlin, Germany 
2 Cambridge University, UK 



Abstract. Recent improvements of the LEO-II theorem prover are pre- 
sented. These improvements include a revised ATP interface, new trans- 
lations into first-order logic, rule support for the axiom of choice, detec- 
tion of defined equality, and more flexible strategy scheduling. 
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1 Introduction 

It has been five years since the last system description of Leo-II [6], and during 
the last months various improvements have been made to the system. In this 
article we outline the current system and describe the recent improvements. 

2 System overview 

Leo-II is written in OCaml and implements a RUE calculus [12] which relies on a 
'Boolean aware' (or, more generally, 'theory aware' [3]) extensional preunification 
engine. Leo-II accepts problems encoded in the CNF (clausal first-order form) 
and FOF (first-order form) languages from the TPTP [IS], but its principal 
input language is THFO, core typed higher-order form [Po] . 

The logical organisation of the prover is illustrated in Figure [TJ and this 
roughly corresponds to the modular organisation of the code. It is structured 
into four layers, as the figure shows: 

Operating mode. The prover can be operated in two ways: (i) Leo-II can 
be used as a proof assistant when run in interactive mode. It provides a 
command interface through which the user can inspect and manipulate the 
prover's state, making calls to the calculus' rules as needed. This mode is 
very valuable for exploring logical problems and for debugging the prover's 
automatic mode, (ii) The prover is usually run in automatic mode: this com- 
prises a set of strategy schedules, and a main loop which drives applications 
of the calculus' rules. 

Prover interface. Both modes use a common infrastructure: they parse a 
problem and load it into the prover's state, then further manipulate the 
state by executing commands. A command might involve carrying out an 
inference, inspecting the state, switching flags, calling external provers, etc. 
Each command makes calls to lower levels of the prover. 
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Fig. 1. Leo-II's architecture 



Logic. The main component in this level consists of the calculus: a collection of 
functions which accept and return clauses. This level also contains Leo-II's 
main loop, and an interface to external ATPs (which also translates problems 
to other formats). 

Basis. The lowest level of Leo-II defines the representation of terms and types, 
and associated operations (e.g. substitution, unification, matching, etc). 



3 Improvements 

The TPTP problem set is the canonical benchmark by which theorem provers 
are presently evaluated. We accompany the description of improvements in this 
section with TPTP problem names whose solution is affected by the feature. 
These problems consist of THF problems drawn from TPTP 5.4.0. We have 
used E version 1.6 as the backend ATP. Our tests were run on a 2GHz AMD 
Opteron with 4GB RAM, and given 60-second timeout. LEO-II was compiled 
with OCaml 3.11.2. 



3.1 ATP interface 

Leo-II cooperates with other provers in order to maximise its potential. We 
improved Leo-II's translation to FOL in recognition of this. Version 1.5 includes 
a better translation into FOF, an experimental translation into TFF [14], and 
supports additional backend ATPs. 

Translation into FOL. Alongside the old translations which were previously 
implemented in Leo-II, version 1.5 features a new translation module which 
was written from scratch. This module contains an intermediate language to 
which problems are first translated, before being transformed further and printed 
into a specific target syntax. HOL-to-FOL translations consist of a pipeline of 
functions which bring HOL formulas into this intermediate language, applying 



analyses and transformations along the way. We are also experimenting with 
lighter encoding of type information. We have closely followed Claessen et al [7] 
to implement their monotonicity analysis by producing a SAT encoding, which 
we send to MiniSat using an interface adapted from Satallax [2]. 

Leo-II's old and new FOF encodings can be used via the command-line ar- 
guments — translation fully-typed and — translation fof jfull respec- 
tively. The gain of f of _full over fully-typed is due to improved handling of 
formulas — for instance, the new FOF translation implements full A-lifting, which 
the old translation didn't. The f of _full translation is now set as default. 



Backend ATPs. Leo-II is mainly used in combination with E [11 , and version 
Leo-II 1.5 features small improvements in how it interacts with E. In version 
1.5 we improved Leo-II's ATP interface and added support for various other 
backend ATPs, including remote provers on SystemOnTPTP [15] . 



3.2 Support for Axiom of Choice 

The default semantics for THFO is Henkin semantics with choice. Until version 
1.5, Leo-II did not support reasoning with choice, unless naive Skolemization was 
used — that is, first-order Skolemization without employing further restrictions 
(as investigated by Miller [5]). This enables limited reasoning with choice, and 
succeeds in some example cases, but it fails in many others [5j Section 3.2]. 

In order to extend Leo-II to support the axiom of choice (AC), instances 
of AC could be automatically added to the input problem. An example is the 
following instance of AC for type (t — > o) — > l: 

3E^ o) ^yP ( ^ o) . 3X t (P X)^P(E P) (1) 

However, such kinds of impredicative axioms should generally be avoided in 
automated proof search since they allow for simulation of the cut rule in any 
Henkin-complete THF prover [4]. 

Our approach involves adding two new rules to Leo-II: detectChoiceFn and 
choice. The first rule detects and removes instances of AC, such as (fT]) above, and 
keeps a register of choice functions CFs. CFs always contains at least one choice 
function symbol for each choice type. The second rule gives the semantics to 
choice functions. Taken together, these rules allow AC-valid reasoning without 
the risk of cut-simulation. 

In more detail, rule detectChoiceFn removes choice-axiom clauses from the 
search space and registers the corresponding choice function symbols / in CFs. 

[pxf v {P(f {a ^ Q PT 

detectChoiceFn 

CFs < — CFs U {/ (Q ^ )^ Q } 

Rule choice investigates whether a term e( a ^ )_j. Q B a ^ (where e G CFs is a 
registered choice function or a free variable) is contained as a subterm of a 
literal [A] p in a clause C. In this case it adds the instantiation of AC at type 



(a — > o) — > a, and with term B, to the search space. Side-conditions guard 
against unsound reasoning, such as the 'uncapturing' of free variables in B: 

e e CFs, E = e or E 6 freeVars(C), 

C := C V [A[£ (Q ^ w Q B]f freeVars{B) C freeVars(C),Y fresh 

5 Z choice 

[B Yf V [B (e (Q ^ o) ^ Q B)] tt 

Rules detectChoiceFn and choice are obviously sound: detectChoiceFn simply re- 
moves clauses from the search space, and for any choice function /, the rule 
choice only introduces new instances of the corresponding choice axiom. 

There is a correspondence with the handling of choice in Satallax. Satallax too 
considers only selective instantiations of AC in order to avoid cut-simulation. For 
instance, when (fTJ) is assumed, the terms T which Satallax considers to be eligible 
instantiations for variable P are those occurring in formulas of the following 
forms in a tableau branch (and where e is a choice function): (e T) Si ... S„ or 
-.((e T) Si . . . S„), or the disequations (e T)S X . . . S n ^ S or S ^ (e T)Si . . . S n . 
It is easy to see that our rule choice, which is less restrictive, subsumes these 
cases. We also experimented with Satallax's approach in Leo-II but this led to 
worse results. Our choice rule is more closely related to that of Mints [5]. Use of 
the choice rules can be disabled using the -nuc command-line switch. 



3.3 Detection of denned equality 

Primitive equality in HOL refers to the use of the interpreted constant '='. 
Equality can also be defined in HOL — for example, as \X a XY a \/P a ^ . P X 
P Y or XX a XY a yQ a ^ a ^ . yZ a (Q Z Z) Q X Y. The former is known 
as Leibniz equality and the latter we call Andrews equality (cf. [1., Exercise 
X5303). Both Leibniz and Andrews equality support cut-simulation due to their 
impredicative nature [4], and should thus be avoided in proof automation. In 
fact, using primitive, rather than defined, equality may save many primitive 
substitution steps in proofs. Such steps involve instantiations of set variables, and 
this generally involves blind guessing. Examples of the benefit of using primitive, 
rather than defined, equality have been given in the literature [5J Sections 5.1 and 
5.2]. In order to address this issue we added the following two rules to Leo-II's 
calculus; they instantiate the set variable P with primitive equality: 

C V [PAf V [PB] tt C V [PAAf 
LeibEQ AndrEQ 

C{\X. A = X/P} V [A = B] tt C{\X\Y. X = Y/P} 

Soundness of LeibEQ and AndrEQ is obvious, since both rules simply realise 
specific instances of primitive substitution. For improved configurability, either 
rule can be individually disabled from the command-line by using the switches 
-nrleq and -nraeq respectively. If LeibEQ is used in combination with the 
new FOF translations (see Section 13- 1 p several TPTP problems whose previ- 
ous SZS [13] status was 'Unknown' can now be solved by Leo-II. Examples in- 
clude SY0246"5.p, SY0244"5.p, NUM817-5.p, NUM816"5.p, and NUM814"5.p. 



There are also many problems that can now be solved with primitive substitution 
(blind guessing) disabled when LeibEQ and AndrEQ are available. Overall, these 
two new rules lead to significantly better coverage using the lighter primitive- 
substitution search modes -ps or -ps 1. 

3.4 Strategy scheduling 

Strategy schedules were added to Leo-II in version 1.2 and the catalogue of 
schedules has slowly increased in the versions that followed. In version 1.5 we re- 
coded the strategy-scheduling feature to facilitate the encoding of new strategies, 
to improve code reuse with other parts of Leo-II, and to have greater flexibility 
when encoding strategies. 

We are also interested in computing strategies on-the-fly based on problem 
characteristics, and version 1.5 carries out some small initial checks (e.g. size of 
the problem, and whether it contains instances of AC), and schedules strategies 
based on that limited analysis. Optimising this further remains as future work. 

3.5 Other improvements 

Numerous other additions were made to Leo-II. Previously, Leo-II was entirely 
focused on refutation: that is, until version 1.5, in terms of the SZS classifica- 
tion, Leo-II would judge a problem to be a Theorem (if a refutation exists), 
Unsatisfiable (if the problem's axioms themselves can be refuted) , or diverge (by 
extending the preunification depth and reattempting a refutation). It can now 
classify Satisfiable problems and detect CountcrSatisfiable problems, thus im- 
proving both Leo-IFs precision and termination behaviour. The added support 
for choice was very relevant for achieving this. 

Leo-IFs unification algorithm has been redone, and can be set (from the 
command-line) to disregard Boolean and functional extensionality. This has 
strengthened Leo-IFs behaviour in non-extensional problems, since disabling 
the extensional behaviour shrinks the search space. 

Numerous other improvements and fixes have been made: these range from 
system features (such as the parser, status reporting, avoiding redundant com- 
putations, etc) to deeper areas in the calculus and main loop (including factori- 
sation, subsumption, and clause selection). 

4 Future work 

We have started experimenting with using term orderings to influence literal se- 
lection. We also plan to revise Leo-IFs internals to make full use of the potential 
benefit they offer. For instance, the shared term graph is currently underutilised. 

More work is needed to compute better schedules, paired with better problem 
analyses. Such analyses can determine the scheduling of specific strategies, which 
can be better tuned to the problem. 



SZS Status fully-typed fof_full f of .experiment 
Thm 6X8 64~9 65\3 

All 60.9 61 61.3 

Table 1. Comparing FOL encodings in Leo-II 1.5 (30s timeout). Table shows the 
percentage of matches between Leo-IFs SZS output and the 'Status' field of problems. 

Timeout (s) vl.2 vl.4.3 vl.5 

Thm All Thm All Thm All 
30 581 50 621 54l 643 61.3 

60 58.7 51.3 65 56.9 67.1 62.9 

Table 2. Percentage match between different versions of Leo-II and the Status field 
of TPTP problems. LEO-II version 1.2 was the winner of the CASC competition 
in 2010, and version 1.4.3 was the last public release. Version 1.5 was run with the 
f of .experiment encoding. 



The ATP interface can be improved further to call multiple backend ATPs 
in parallel. Experiments comparing 30-second invocations of Leo-II on all THF 
problems, supported by provers E (version 1.6), SPASS (version 3.5) |17| and 
Vampire (version 2.6) [TU] showed us that there were 37, 5 and 20 theorems that 
were proved exclusively by Leo-II(E), Leo-II(SPASS) and Leo-II (Vampire), 
respectively. And there were 31, 95 and 98 theorems that Leo-II(E), Leo- 
II(SPASS) and LEO-II(Vampire) missed, but which one of the others could prove. 

Supporting various ATP backends increases the scope for peephole optimisa- 
tion; we have not yet investigated this. The translation module can be optimised 
further, and extended to target more formats. Table [Done shows how the new 
HOL-to-FOL translation (fof Jull) and its lighter variant (f of .experiment) 
are superior to Leo-IFs preexisting encoding (f ully.typed). In future work we 
plan to improve f of .experiment further and make it the default translation. 

5 Conclusion 

Version 1.5 of Leo-II includes various improvements which affect its performance 
and completeness. To obtain a broader picture, we compared the results of using 
Leo-II version 1.5 with earlier versions, and the results are shown in Table [2j 
In this experiment we counted the matches between Leo-IFs SZS output and 
the TPTP problem's SZS status (included in its header) H All the net gains 
are positive, but a more thorough evaluation (on different benchmarks, and 
considering various parameters) remains as future work. Within a 30s timeout, 
Leo-II version 1.5 can classify 196 more problems than its predecessor. The main 
boost in this version is provided by the detection of non-theorems (y§§). 



3 This also means that 'Unknown' problems which Leo-II now classifies as 'Theorem' 
count against us, but this experiment was only intended to offer a rough idea of 
progress. 
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